High Performance Linear System Solver with Resilience to Multiple Soft Errors

نویسندگان

  • Peng Du
  • Piotr Luszczek
  • Jack Dongarra
چکیده

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, with integrated circuit technology scaling below 65 nm, the critical charge required to flip a gate or a memory cell is dangerously reduced. Combined with higher vulnerability to cosmic radiation, soft errors are expected to become anything but inevitable for modern supercomputer systems. As a result, for long running applications on high-end machines, including linear solvers for dense matrices, soft errors have become a serious concern. Classical checkpoint and restart (C/R) scheme loses effectiveness against this threat because of the difficulty to detect soft errors in the form of transient bit flips that do not interrupt program execution and therefore leave no trace of error occurrence. Current research of soft errors resilience for dense linear solvers offers limited capability when faced with large scale computing systems that suffer both round-off error from floating point arithmetic and the presence followed by propagation of multiple soft errors. The use of error correcting codes based on Galois fields requires high computing cost for recovery. This work proposes a fault tolernat algorithm for dense linear system solver that is resilient to multiple spatial and temporal soft errors. This algorithm is designed to work with floating point data and is capable of recovering the solution of Ax = b from multiple soft errors that affect any part of the matrix during computation. Additionally, the computational complexity of the error detection and recovery is optimized through novel methods. Experimental results on cluster systems confirm that the proposed fault tolerance functionality can successfully detect and locate soft errors and recover the solution of the linear system. The performance impact is negligible and the soft errors resilient algorithm’s performance scales well on large scale systems. Keywords-soft error; fault tolerance; multiple errors; dense linear system solver;

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, as integrated circuit technology scales below 65 nm, the critical charge required to flip a gate or a memory cell has been dangerously reduced, causing higher cosmic-radiations-induced soft error rate. Soft error threatens computing system by producing silently data corruption which i...

متن کامل

An Evolutionary Method for Improving the Reliability of Safetycritical Robots against Soft Errors

Nowadays, Robots account for most part of our lives in such a way that it is impossible for usto do many of affairs without them. Increasingly, the application of robots is developing fastand their functions become more sensitive and complex. One of the important requirements ofRobot use is a reliable software operation. For enhancement of reliability, it is a necessity todesign the fault toler...

متن کامل

SASSIFI: Evaluating Resilience of GPU Applications

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience will grow increasingly important. As soft errors, such as those caused by high-energy particle strikes, form an important fraction of in-field hardware errors, GPU designers must develop tools and techniques to understand the effect of...

متن کامل

Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and eva...

متن کامل

Solver Device for Powdery Drugs

Pharmacotherapy is a major treatment method in healthcare centers, and the injection of powdered drugs is among common pharmacotherapy techniques. Medication errors and nosocomial infections are among major health issues in the world. On the other hand, powdered drugs are widely used in hospitals; however, drug mixture is a very time-consuming process. The objective of this invention was to acc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011